rawdata <- readr::read_csv("DATA2X02 class survey 2020 (Responses) - Form responses 1.csv")
data <- rawdata %>%
  janitor::clean_names() %>%
  mutate(
    timestamp = lubridate::dmy_hms(timestamp)
  ) %>%
  rename(
    covid = how_many_times_have_you_been_tested_for_covid,
    postcode = postcode_of_where_you_live_during_semester,
    dentist = how_long_has_it_been_since_you_last_went_to_the_dentist,
    uni_work = on_average_how_many_hours_per_week_did_you_spend_on_university_work_last_semester,
    social_media = what_is_your_favourite_social_media_platform,
    dog_cat = did_you_have_a_dog_or_a_cat_when_you_were_a_child,
    parents = do_you_currently_live_with_your_parents,
    exercise = how_many_hours_a_week_do_you_spend_exercising,
    eye_colour = what_is_your_eye_colour,
    asthma = do_you_have_asthma,
    paid_work = on_average_how_many_hours_per_week_did_you_work_in_paid_employment_in_semester_1,
    season = what_is_your_favourite_season_of_the_year,
    shoe = what_is_your_shoe_size,
    height = how_tall_are_you,
    floss = how_often_do_you_floss_your_teeth,
    glasses = do_you_wear_glasses_or_contacts,
    hand = what_is_your_dominant_hand,
    steak = how_do_you_like_your_steak_cooked,
    stress = on_a_scale_from_0_to_10_please_indicate_how_stressed_you_have_felt_in_the_past_week
  ) %>%
  mutate(
    gender = case_when(
      grepl("f", tolower(gender), fixed=T) ~ "Female",
      grepl("m", tolower(gender), fixed=T) ~ "Male",
      TRUE ~ "Non-binary"
    ) %>% as.factor(),
    dentist = factor(dentist, levels = c("Less than 6 months", "Between 6 and 12 months", 
                                             "Between 12 months and 2 years", "More than 2 years", NA),
                         ordered = T) %>% 
            recode(
                `Less than 6 months` = "< 6 months",
                `Between 6 and 12 months` = "6 - 12 months",
                `Between 12 months and 2 years` = "12 months - 2 years", 
                `More than 2 years` = "> 2 years"
            ),
    dog_cat = as_factor(dog_cat),
    parents = as_factor(parents),
    postcode = as.factor(postcode),
    asthma = as_factor(asthma),
    season = factor(season, levels = c("Summer", "Autumn", "Winter", "Spring"), ordered = T),
    floss = factor(floss, levels = c("Less than once a week", "Weekly", "Most days", 
                                        "Every day", NA), ordered = T),
    glasses = as_factor(glasses),
    hand = case_when(
      hand == "Right handed" ~ "Right",
      hand == "Left handed" ~ "Left",
      TRUE ~ "Ambidextrous"
    ) %>% as_factor(),
    steak = factor(steak, levels = c(
      "Rare", "Medium-rare", "Medium", "Medium-well done", "Well done",
      "I don't eat beef"
    ), ordered = T)
  )

1 Experiment design

1.1 Bias

  • Selection bias / Sampling bias: the sample does not accurately represent the population. (e.g. As the class survey is published on Ed, students who spend more time on checking Ed post are more likely to complete the survey.)

  • Non-response bias: certain groups are under-represented because they elect not to participate

  • Measurement or designed bias: bias factors in the sampling method influence the data obtained.

1.2 Controlled experiments vs Observational studies

randomized controlled double-blind study:

  • investigator randomly allocate the subject into a treatment group and a control group. The control group is given a placebo but both the subject and investigators don’t know the identity of the groups.

  • the design is good because we expect the 2 groups to be similar thus any difference in the responses is likely to be caused by the treatment.

controlled vs observational:

  • A good randomized controlled experiment can establish causation

  • An observational study can only establish association. It may suggest causation but can’t prove causation.

1.3 Confounding

Confounding occurs when the treatment group and control group differ by some third variable than the treatment) which influences the response that is studied.

  • if any of the subjects drop out, causing selection bias or survivor bias.
  • if not all subjects keep taking treatment or placebo, the confounding of adherers and non-adherers occurs.

Controlling for confounding: make groups more comparable by dividing them into subgroups with respect to the confounders. (e.g. if alcohol consumption is a potential confounding factor, then divide subjects into heavy drinkers, medium drinkers and light drinkers)

  • limitation of controlling:
    • this can be limited by our ability to identify all confounders and then divide the study by the confounders.
    • This explains the long time to establish that smoking causes lung cancer. Researchers needed to control for factors such as health, nests, diet, lifestyle, environment etc.

1.4 Simpson’s Paradox

  • A clear trend in individual groups of data disappears when the groups are pooled together.
  • It occurs when relationships between percentages in subgroups are reversed when the subgroups are combined, because of a confounding or lurking variable.

2 Performance evaluation

  • \(D_+\): the event that an individual has a particular disease
  • \(D_-\): the event that an individual does not have a particular disease
  • \(S_+\): represent a positive screening test result.
  • \(S_-\): represent a negative screening test result.

  • False negative rate: \(\frac{FN}{TP+FN}\)
  • False positive rate: \(\frac{FP}{FP+TN}\)
  • Sensitivity / Recall: \(\frac{TP}{TP+FP}\)
  • Specificity: \(\frac{TN}{TN+FN}\)
  • Precision / Positive predictive: \(\frac{TP}{TP+FN}\)
  • Negative predictive: \(\frac{TN}{FP+TN}\)
  • Accuracy: \(\frac{TP+TN}{TP+FN+FP+TN}\)

3 Measure of risk

Prospective / cohort study: subjects are initially identified as disease-free and classified by presence or absence of a risk factor.

  • one sample from the risk factor group \(R^+\)
  • another sample from the non-risk factor group \(R^-\)

Retrospective / case control study: take random samples from each of the two outcome categories which are followed back to determine the presence or absence of the risk factor.

  • one sample from the disease group \(D^+\)
  • another sample from the non-disease group \(D^-\)

Relative risk: The relative risk is the ratio of the probability of having the disease in the group with the risk factor to the probability of having the disease in the group without the risk factor. \(RR=\frac{P(D^+|R^+)}{P(D^+|R^-)}\)

  • For Prospective study only.
  • Not for Retrospective study because the proportions of cases with \(D^+\) and \(D^-\) were decided by the investigator, which means we cannot estimate \(P(D^+|R^+)\) and \(P(D^+|R^-)\).
  • RR = 1 - there is no difference between the two groups; RR < 1 the disease is less likely to occur in the group with the risk factor; RR > 1 the disease is more likely to occur in the group with the risk factor;

Odds ratio:

  • for both Prospective and Retrospective studies
  • OR = 1 if and only if risk factor and disease are independent; OR > 1 the disease is less likely to occur in the group with the risk factor; OR < 1 the disease is less likely to occur in the group with the risk factor;
  • standard error = \(\sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}}\)
  • confidence interval (at 95%): if CI includes 1, risk factor & disease are independent. \(CI = exp(log(\hat{OR})\pm 1.96\times\sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}})\)

4 Random variables

Expectations of random variable:

  • \(E(X)=\mu\) - \(\mu\) is the population mean
  • \(Var(X)=\sigma^2\) - \(\sigma^2\) is the population variance
  • \(SD(X)=\sigma\)
  • \(E(cX)=cE(X)\)
  • \(Var(cX)=c^2Var(X)\)

For \(T=\sum^n_{i=1}X_i\):

  • \(E(T)=E(X_1+...+X_n)=E(X_1)+...+E(X_n)=\mu+...+\mu=n\mu\)
  • \(Var(T)=Var(X_1+...+X_n)=Var(X_1)+...+Var(X_n)=\sigma^2+...+\sigma^2=n\sigma^2\)

Sampling:

  • sample mean: \(E(\bar X)=E(\frac{1}{n}T)=\frac{1}{n}E(T)=\frac{1}{n}n\mu=\mu\).
  • sample variance: \(Var(\bar X)=Var(\frac{1}{n}T)=(\frac{1}{n})^2 Var(T)=\frac{1}{n^2}n\sigma^2=\frac{\sigma^2}{n}\)
  • standard error: \(SE=SD(\bar X)=\sqrt{Var(\bar X)}=\frac{\sigma}{\sqrt n}\)
  • estimated SE: \(\hat{SE}=\frac{s}{\sqrt n}\), where \(s^2=\frac{1}{n-1}\sum^n_{i=1}(x_i-\bar x)^2\)

Importance of SE: it tells us the likely size of the estimation error so that we know how accurate or reliable the estimate is.

5 Confidence Intervals

Confidence interval: a statement about the underlying population parameter (NB: confidence \(\neq\) probability).

  • As sample size increases: 1) CI gets narrower; 2) Standard error decreases;

  • 99% CI is wider than 95% CI
  • Converge probability = \(1-\alpha\): the “true” value of the unknown parameter lies inside the confidence interval.

Interpret Confidence interval:

  • It is a range of plausible values for the null hypothesis.
  • If you perform a large number of identically designed experiments and for each experiment you calculated a 95% confidence interval then 95% of those intervals would contain the true population mean.
  • 95% confidence that a single interval contains the true mean.

Estimation vs Hypothesis testing:

  • Estimation: a population parameter is unknown. Use the sample statistics to generate estimates of the population parameter.
  • Hypothesis testing: explicit statement regarding the population parameter. Test statistics are generated which will either support or reject the null hypothesis.

6 Critical value

If critical value is greater than \(t_0\), reject \(H_0\).

Two-sided discrepancies of interest:

  • t-test: \(|\bar x-\mu_0|>c\frac{c}{\sqrt n}\)
  • Confidence interval: \(\bar x \pm c\frac{s}{\sqrt n}\)

False alarm rate / Significance level: a given value \(\mu_0\) is rejected incorrectly.

Normal population: use the t-distribution: under the special statistical model where the data are modeled as values taken by iid normal random variables, if the true population mean is indeed \(\mu_0\), then the ratio \(\frac{\bar X-\mu_0}{S/ \sqrt n}\)~\(t_{n-1}\). Thus we choose c such that \(P(|\bar X-\mu_0|>c\frac{S}{\sqrt n})=P(\frac{|\bar X-\mu_0|}{S/\sqrt n}>c)=P(t_{n-1}>c)=\alpha\).

Reject null: \(t_0\) > critical value

6.2 Critical value decision rule

For a test of \(H_0:\mu=\mu_0\) vs \(H_1:\mu>\mu_0\):

  • for \(H_1:\mu>\mu_0\), reject \(H_0\) if \(t_0\geq t_{n-1}(1-\alpha)\)
  • for \(H_1:\mu<\mu_0\), reject \(H_0\) if \(t_0\leq t_{n-1}(\alpha)\)
  • for \(H_1:\mu\neq\mu_0\), reject \(H_0\) if \(|t_0|\geq |t_{n-1}(\alpha/2)|\); don’t reject \(H_0\) if \(|t_0|< |t_{n-1}(\alpha/2)|\)

6.3 Hypothesis test using rejection region

Hypothesis: \(H_0:\mu=\mu_0\) vs \(H_1:\mu\neq \mu_0,\mu<\mu_0,\mu>\mu_0\)

Assumptions: \(X_i\) are independently and identically distributed and follow \(N(\mu,\sigma^2)\)

Test statistic: - \(\sigma\) is known: \(T=\frac{\bar X-\mu_0}{S/ \sqrt n}\). Under \(H_0\), test statistic follows a t distribution with \(n-1=137\) degree of freedom. - \(\sigma\) is unknown: \(Z=\frac{\bar X-\mu_0}{\sigma/ \sqrt n}\). Under \(H_0\), test statistic follows a normal distribution \(N(0,1)\).

\(\mu_0=130, df = 137, S=15.78, n = 138\)

Rejection region: \(critial~value = t_{n-1}(\alpha) / t_{n-1}(\alpha/2)\) - \(\sigma\) is unknown: \(H_1:\mu\neq \mu_0,\mu<\mu_0,\mu>\mu_0\): \(|t_0|\geq |t_{n-1}(\alpha/2)|,~t_0\leq t_{n-1}(\alpha),~t_0\geq t_{n-1}{\alpha}\) - \(\sigma\) is known: \(H_1:\mu\neq \mu_0,\mu<\mu_0,\mu>\mu_0\): \(|z_0|\geq |z(\alpha/2)|,~z_0\leq z(\alpha),~z_0\geq z(\alpha)\)

Decision: - The observed test statistic, \(t_0=..\) is greater than the critical value \(t_{n-1}(\alpha)=..\). So we do not reject \(H_0\) and conclude that \(\mu=..\). - The observed test statistic, \(t_0=..\) is smaller than the critical value \(t_{n-1}(\alpha)=..\). So we do reject \(H_0\) and conclude that \(\mu\neq/>/<..\).

8 Goodness of fit test

8.0.1 Do the data obtained in line with the claim?

Hypothesis: \(H_0:p_1=p_{10},p_2=p_{20},...,p_k=p_{k0}\) vs \(H_1:\) at least one equality does not hold.

Assumptions:

  • independent observations
  • expected counts are all greater than 5 (i.e. \(e_i=np_{i0}\geq 5\))

Test statistic: \(T = \sum_{i=0}^k\frac{(Y_{i} - e_{i})^2}{e_{i}}\). Under \(H_0\), the test statistic follows a chi-square distribution with \(k-p-1\) degree of freedom.

Observed test statistic: \(t_0 = \sum_{i=0}^k\frac{(y_{i} - e_{i})^2}{e_{i}}\) = 0.1

## 
##  Chi-squared test for given probabilities
## 
## data:  y_i
## X-squared = 0.096342, df = 2, p-value = 0.953

P-value:\(P(T\geq t_0)=P(X^2_{`k-1-q`}\geq\) 0.1) = 0.953

Decision:

  • Since the p-value < 0.05, there is strong evidence in the data against \(H_0\).
  • Since the p-value > 0.05, the null hypothesis is not rejected.

8.0.2 Whether the data follows a Poisson distribution

With the identity of a Poisson distribution that \[X \sim \text{Poisson}(\lambda) \Longrightarrow E[X] = \lambda\] The estimated lambda is calculated by: \[ \hat{\lambda} = \bar{x} = \sum_{k=1}^n\frac{x_k}n \]

Hypothesis: \(H_0:p_1=p_{10},p_2=p_{20},...,p_k=p_{k0}\) vs \(H_1:\) at least one equality does not hold.

Assumptions:

  • independent observations
  • expected counts are all greater than 5 (i.e. \(e_i=np_{i0}\geq 5\))

Test statistic: \(T = \sum^k_{i=1}\frac{(Y_i-np_i)^2}{np_i}\), under \(H_0\), degree of freedom is \(k-1-q\) = 1 , where k is the number of groups and q is the number o f parameters that needs to be estimated from the data.

Observed test statistic: With the observed frequencies \(y_i\) from the data and estimated parameter \(\lambda\) = 0.5362, \(t_0\) = 15.63

P-value: \(P(T\geq t_0)=P(\chi^2_{1}\geq 15.63) = 10^{-4}\)

Decision

  • Since the p-value is greater than 0.05, we do not reject the null hypothesis. The data are consistent with a Poisson distribution.

  • Since the p-value is smaller than 0.05, we reject the null hypothesis. The data does not follow a Poisson distribution.

9 Test of homogeneity - Chi-squared test

Test of homogeneity: Test whether the probability distributions of the categorical feature are the same over the different populations.

Hypothesis: \(H_0:p_{1j}=p_{2j}\) for \(j=1,2,3\) vs \(H_1: p_{11}\neq p_{21}, p_{21}\neq p_{22}\).

Assumptions: \(e_{ij}=\frac{y_iy_j}{n} \geq 5\).

Test statistic: \(T=\sum^r_{i=1}\sum^c_{j=1}\frac{Y_{ij}-e_{ij}^2}{e_{ij}}\). Under \(H_0\), the degree of freedom is \((r-1)(c-1)\) = 1, where r is the number of rows and c is the number of columns in contingency table.

## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 11.635, df = 2, p-value = 0.002975

Observed test statistic: \(\sum^{2}_{i=1}\sum^{3}_{j=1}\frac{y_{ij}-e_{ij}^2}{e_{ij}}\).

P-value: \(P(T \geq 11.63) = P(\chi^2_{2} \geq 11.63) = 0.003\)

Decision:

  • Since the p-value is greater than 0.05, we do not reject the null hypothesis. There is no significant difference in … between .. and ..
  • Since the p-value is smaller than 0.05, we reject the null hypothesis. There is significant difference in … between .. and ..

10 Test of independence - Chi-squared test

Test of independence: observations are collected from one population and two categorical variables are observed from each unit.

Hypothesis: \(H_0:p_{ij}=p_ip_j\) for \(i=1,2,...,r,j=1,2,...,c\) vs \(H_1:\) Not all equalities hold.

Assumptions: all expected counts are greater or equal to 5 (i.e.\(e_{ij}=\frac{y_iy_j}{n} \geq 5\))

t0=qchisq(0.05, 1:6, lower.tail = FALSE) pval = pchisq(t0, 1:6, lower.tail = FALSE)

Test statistic: \(T=\sum^r_{i=1}\sum^c_{j=1}\frac{(Y_{ij}-e_{ij})^2}{e_{ij}}\). Under \(H_0\), the degree of freedom is \((r-1)(c-1)=\) 2, where c is the number of columns and r is the number of rows in the contingency table.

Observed statistic: \(t_0=\sum^{2}_{i=1}\sum^{3}_{j=1}\frac{(y_{ij}-{y_iy_j/n})^2}{y_iy_j/n}=\) 11.63

P-value: \(P(T\geq 11.63) = P(\chi^2_{2}\geq 11.63) = 0.003\)

Decision:

  • Since the p-value is greater than 0.05, we do not reject the null hypothesis. There is no association between … and …
  • Since the p-value is smaller than 0.05, we reject the null hypothesis. There is no an association between … and …

11 Test in small samples (cell counts < 5)

11.1 Fisher’s exact test - 2 by 2 table

Fisher’s exact test: Consider all possible permutations of 2 by 2 contingency table with the same marginal totals. Then calculate how many of these were equal to or “more extreme” than what we observed. As such, the Fisher’s exact test does not require the expected cell counts to be \(\geq 5\).

Drawbacks:

  • It assumes that row and column margins are fixed.
  • Need to use Monte Carlo for large contingency tables.

Hypothesis: \(H_0:\) there is no association between … and … vs \(H_1:\) there is an association between … and …

Assumptions: Consider all possible permutations of 2 by 2 contingency table with the same marginal totals. Then calculate how many of these were equal to or “more extreme” than what we observed. As such, the Fisher’s exact test does not require the expected cell counts to be \(\geq 5\).

## 
##  Fisher's Exact Test for Count Data
## 
## data:  tab
## p-value = 0.00157
## alternative hypothesis: greater

The degrees of freedom is not calculated as it is not relevant to fisher’s exact test.

Decision:

  • P-value = 0.0016 \(<0.05\), \(H_0\) is rejected therefore there is an association between … and …
  • P-value = 0.0016 \(>0.05\), \(H_0\) is not rejected therefore there is no association between … and …

11.2 Yate’s corrected chi-squared test

Yate’s correction: Apply continuity correction with a chi-squared test, using the identity \(P(X\leq x) \approx P(Y\leq x+0.5)\) and \(P(X\geq x) \approx P(Y\geq x-0.5)\).

Hypothesis: \(H_0:\) there is no association between … and … vs \(H_1:\) there is an association between … and …

Assumption: Yate’s correction applies continuity correction to approximate integer-valued variable, therefore, it does not restrict the cell counts.

## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 11.635, df = 2, p-value = 0.002975

Test statistic: \(T=\sum^r_{i=1}\sum^c_{j=1}\frac{(|Y_{ij}-e_{ij}|-0.5)^2}{e_{ij}}\), which approximately follows a \(\chi^2_{(r-1)(c-1)}\) distribution under \(H_0\). The degree of freedom is \((r-1)(c-1)=\) 2, where r is the number of rows and c is the number of columns in the contingency table.

Observed test statistic: \(t_0=\sum^{2}_{i=1}\sum^{3}_{j=1}\frac{(|y_{ij}-e_{ij}|-0.5)^2}{e_{ij}}\) = 11.63.

P-value: \(P(T\geq 11.63) = P(\chi^2_{2}\geq 11.63) = 0.003\)

Decision:

  • As p-value \(<0.05\), \(H_0\) is rejected therefore there is an association between … and …
  • As p-value \(>0.05\), \(H_0\) is not rejected therefore there is no association between … and …

11.3 Monte Carlo simulation

Monte Carlo simulation: Resample (i.e. randomly generate contingency tables) and perform chi-squared tests by many times. Calculate the test statistic for each of the resamples and create a sampling distribution of test statistics. P-value is calculated by determining the proportion of the resampled test statistics \(\geq\) the observed test statistic.

Hypothesis: \(H_0:\) there is no association between … and … vs \(H_1:\) there is an association between … and …

Assumptions: No assumptions are made about the underlying distribution of the population. The cell counts are also not restricted for Monte Carlo simulation.

To calculate the p-value, a Monte Carlo simulation is perform, with 10000 simulations of chi-squared test. Note that degree of freedom is not considered as it is not relevant to the Monte Carlo simulation.

## 
##  Pearson's Chi-squared test with simulated p-value (based on 10000
##  replicates)
## 
## data:  tab
## X-squared = 11.635, df = NA, p-value = 0.0012

Test statistics: the test statistic is calculated for each of the resamples by \(T=\sum^r_{i=1}\sum^c_{j=1}\frac{(Y_{ij}-e_{ij})^2}{e_{ij}}\).

Observed test statistic: \(t_0=\sum^r_{i=1}\sum^c_{j=1}\frac{(y_{ij}-e_{ij})^2}{e_{ij}}=\) 11.63

P-value: \(P(T\geq\) 11.63) = \(P(\chi^2\geq\) 11.63) = 0.0012

Decision:

  • As p-value \(<0.05\), \(H_0\) is rejected therefore there is an association between … and …
  • As p-value \(>0.05\), \(H_0\) is not rejected therefore there is no association between … and …

12 T-test

12.1 One sample t-test

The data doesn’t follow a normal distribution if:

  • A lower bend of zero exist?
  • Not close enough to the line to be considered ‘normal’.

Hypothesis: \(H_0: \mu =30\) vs \(H_1: \mu ><\neq 30\)

Assumptions:

  • Each observation is chosen at random from a population.
  • Variables are independently and identically distributed and follow \(N(\mu,\sigma^2)\)

Test statistic: \(T=\frac{\bar{X}-\mu_0}{s/\sqrt{n}}\). Under \(H_0\), the data tends to follow t distribution with \(n-1\) degree of freedom.

Observed test statistic: \(t_0=\frac{28.34-30}{15.78/\sqrt{138}}=-1.24\)

P-value: \(P(t_{137}\leq -1.24)=0.2188\)

Decision:

  • The p-value is smaller than 0.05, we reject the null hypothesis. The mean of … is equal to 30.
  • The p-value is greater than 0.05, we does not reject the null hypothesis. The mean … is not equal to / greater than / less than 30.

12.2 Two-sample t-test

Two-sample t-test: test whether the population mean of two samples are different.

  • The two-sample t-test is a parametric test where the test statistic is assumed to follow some distribution.

Welch two-sample t-test: does not assume equal population variances.

The data doesn’t follow a normal distribution if:

  • A lower bend of zero exist?
  • Not close enough to the line to be considered ‘normal’.

Hypothesis: \(H_0:\mu_x=\mu_y\) vs \(H_1:\mu_x >\mu_y\) or \(\mu_x \leq \mu_y\) or \(\mu_x \neq \mu_y\)

Assumptions:

  • Variables \(X,Y\) are identically and independently distributed and follow \(N(\mu_X,\sigma^2)\) and \(N(\mu_Y,\sigma^2)\).
  • Observations are independent to each other
  • Regular two-sample t-test assumes equal population variances of variables while Welch two-sample t-test does not assume equal variance.

Test statistic: \(T=\frac{\bar{X}-\bar{Y}}{S_p \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\), where \(S^2_p=\frac{(n_x-1)S^2_x+(n_y-1)S^2_y}{n_x+n_y-2}\). Under \(H_0\), the data tend to follow a t distribution with \(n_x+n_y-2\) degree of freedom.

Observed test statistic: \(t_0=\frac{4.48 - 4.48}{3.32 \sqrt{\frac{1}{138}+\frac{1}{138}}}\), where \(S^2_p=\frac{(138 -1)3.32^2+(138-1)3.32^2}{138+138 -2}=0\)

P-value:

  • \(2P(t_{274 \geq |0|})=1\)
  • \(P(t_{274 \leq |0|})=1\)
  • \(P(t_{274 \geq |0|})=1\)

Decision:

  • As p-value is greater than 0.05, we do not reject the null hypothesis. The population mean of two samples are the same.
  • As p-value is smaller than 0.05, we reject the null hypothesis. The population mean of two samples are different.

12.3 Paired samples t-test

Paired samples t-test: measure twice with the same population. (e.g. Blood samples from individuals before and after they smoked a cigarette). The differences between two variables are usually calculated to perform one-sample t-test.

Hypothesis:\(H_0:\mu_d=0\) vs \(H_1:\mu_d\neq 0\)

Assumptions: differences between two samples are independent and identically distributed, with the identity \(N(\mu_d,\sigma^2)\).

Test statistic: \(T=\frac{\bar D}{S_d / \sqrt{n}}\). Under \(H_0\), test statistic tends to follow a t distribution with \(n-1\) degree of freedom.

Observed test statistic: \(t_0=\frac{8.78}{9.98/ \sqrt{9}}\)

P-value:

  • \(2P(t_{8 \geq |2.6373657|})=0.0298\)
  • \(P(t_{8 \leq |2.6373657|})=0.0298\)
  • \(P(t_{8 \geq |2.6373657|})=0.0298\)

Decision:

  • As p-value is smaller than 0.05, we reject the null hypothesis. The population mean of two samples are the same.
  • As p-value is smaller than 0.05, we reject the null hypothesis. The population mean of two samples are different.

13 Sign test / Binomial test - violate normality

Sign test: is used to test \(H_0:\mu = \mu_0\) and paired data when normality is not satisfied.

  • If \(H_0\) is true, the probability \(p_+\), of getting a positive \(D_i\) where \(D_i=X_i-\mu_0\).
  • The sign test reduces to a binomial test of proportions.
  • The sign test is a non-parametric test as no assumption on the data distribution is made except symmetry.
  • Drawback: it ignores all the information on magnitude and hence has low power.

13.1 Sign test for one-sample mean

## 
##  Exact binomial test
## 
## data:  freq
## number of successes = 65, number of trials = 117, p-value = 0.1336
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.4753134 1.0000000
## sample estimates:
## probability of success 
##              0.5555556

Hypothesis: \(H_0:\mu=30\) vs \(H_1:\mu>30, \mu<30,\mu \neq 30\)

Assumptions: Observations are independently sampled from a symmetric distribution.

Test statistic: \[T=number~of (D_i>0)\] where \(D_i=X_i-30\). Under \(H_0\), the test statistic follows a binomial distribution with identity \(B(n,\frac{1}{2})\) where n is the number of non-zero differences.

Observed test statistic: \(t_0=number~of(d_i>0)=52\)

P-value:

  • \(H_1:\mu<\mu_0\) - \(P(T \leq 52)=0.1336\)
  • \(H_1:\mu>\mu_0\) - \(P(T \geq 52)=0.1336\)
  • \(H_1:\mu \neq \mu_0\) & \(t_0<\frac{n}{2}\) - \(2P(T \leq 52)=0.1336\)
  • \(H_1:\mu \neq \mu_0\) & \(t_0>\frac{n}{2}\) - \(2P(T \geq 52)=0.1336\)

Conclusion:

  • The p-value is smaller than 0.05, we reject the null hypothesis. The mean of … is equal to 30.
  • The p-value is greater than 0.05, we does not reject the null hypothesis. The mean … is not equal to / greater than / less than 30.

13.2 Sign test for paired data

Sign test can be used to test differences between paired data when normality is not satisfied.

Hypothesis: \(H_0:p_+=\frac{1}{2}\) vs \(H_1:p_+>\frac{1}{2}\)

Assumptions: Differences \(D_i\) are independent.

## 
##  Exact binomial test
## 
## data:  t0 and n
## number of successes = 7, number of trials = 9, p-value = 0.08984
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.4503584 1.0000000
## sample estimates:
## probability of success 
##              0.7777778

Test statistic: Let \(T\) be the number of positive differences out of the 9 non-zero differences. Under \(H_0\), the test statistic follows a binomial distribution with the identity \(B(9,\frac{1}{2})\).

Observed test statistic: We observed \(t_0=7\) positive differences in the sample.

P-value: probability of getting a test statistic as or more extreme than what we observed, \(P(T \geq 7)=1-P(T \leq 6)=1-pbinom(6,size=9,prob=\frac{1}{2})\approx 0.0898\)

Conclusion:

  • As p-value is smaller than 0.05, we reject the null hypothesis. The population mean of two samples are the same.

  • As p-value is smaller than 0.05, we reject the null hypothesis. The population mean of two samples are different.

14 Wilcoxon signed-rank test - for one sample & paired data

Wilcoxon signed-rank test:

  • The ranks are summed over positive side of the differences.
  • The hypotheses can be stated in terms of mean or median since we assume that the distribution is symmetric
  • The p-value will typically be smaller than the p-value of a sign test. Using the information in the ranks, the test becomes much more powerful in detecting differences from \(\mu_0\) and almost as the powerful as the one sample t-test.

14.1 Using wilcox.test

Hypothesis: \(H_0:\mu=30\) vs \(H_1:\mu>30, \mu<30,\mu \neq 30\)

Assumptions: Observations are independently sampled from a symmetric distribution.

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  diff
## V = 2886.5, p-value = 0.9395
## alternative hypothesis: true location is greater than 0

Test statistic:

  • for one-sided: \(W^+=\sum_{i:D_i>0}R_i\) -for two-sided: \(W=min(W^+,W^-)\)

Observed test statistic:

  • for one-sided: \(t_0=w^+=2886.5\)
  • for two-sided: \(t_0=min(w^+,w^-)=2886.5\)

P-value:

  • \(H_1:\mu<\mu_0\) - \(P(W^+ \leq 2886.5)=0.9395\)
  • \(H_1:\mu>\mu_0\) - \(P(W^+ \geq 2886.5)=0.9395\)
  • \(H_1:\mu \neq \mu_0\)- \(2P(W^+ \leq 2886.5)=0.9395\)

Conclusion:

  • The p-value is smaller than 0.05, we reject the null hypothesis. The mean of … is equal to 30.
  • The p-value is greater than 0.05, we does not reject the null hypothesis. The mean … is not equal to / greater than / less than 30.

14.2 Using normal approximation

For large enough \(n\), we can use normal distribution to approximate the distribution of the sign rank test statistic: \(W^+\)~\(N(\frac{n(n+1)}{4},\frac{n(n+1(2n+1))}{24})\)

Hypothesis: \(H_0:\mu=30\) vs \(H_1:\mu>30, \mu<30,\mu \neq 30\)

Assumptions: Observations are independently sampled from a symmetric distribution.

Test statistic: \(W=min(W^+,W^-)\) where \(W^+=\sum_{i:D_i>0}R_i\), \(W^-=\sum_{i:D_i<0}R_i,\) \(D_i=X_i-30\) and \(R_i\) are the ranks of \(|D_1|,|D_1|,...,|D_n|\). Under \(H_0\), the test statistic, with identity \(W^+\)~\(WSR(138)\), follows a symmetric distribution with mean \(E(W^+)=\frac{n(n+1)}{4}=4795.5\) and \(Var(W^+)=\frac{n(n+1)(2n+1)}{24}=2.2139225\times 10^{5}\).

Observed test statistic: the test statistic is found by determining differences between observations and 30 \(D_i=X_i-\mu_0\), followed by assigning the signed ranks of \(D_i\). The sum of positive ranks (\(w^+\)) and the sum of negative ranks (\(w^-\)) are calculated.

  • We have a two-sided alternative, so the test statistic is \(w=min(w^+,w^-)=3978.5\).
  • We have a one-sided alternative, so the test statistic is \(w=w^+=3978.5\).

By using normal approximation, test statistic can be calculated by: \(t_0=\frac{w-E(w^+)}{\sqrt{Var(w^+)}}=\frac{3978.5-4795.5}{\sqrt{2.2139225\times 10^{5}}}=-1.74\)

P-value:

  • \(H_1:\mu<\mu_0\) - \(P(W^+ \leq -1.74)\approx P(Z\leq\frac{3978.5-E(W^+)}{\sqrt{Var(W^+)}})=P(Z\leq\frac{3978.5-4795.5}{\sqrt{2.2139225\times 10^{5}}}) =P(Z\leq -1.74)=0.0825\)
  • \(H_1:\mu>\mu_0\) - \(P(W^+ \geq -1.74)=1-P(W^+ \leq -1.74)\approx 1-P(Z\leq\frac{3978.5-E(W^+)}{\sqrt{Var(W^+)}})=1-P(Z\leq\frac{3978.5-4795.5}{\sqrt{2.2139225\times 10^{5}}}) =1-P(Z\leq -1.74)=0.0825\)
  • \(H_1:\mu \neq \mu_0\)- \(2P(W^+ \leq -1.74)\approx 2P(Z\leq\frac{3978.5-E(W^+)}{\sqrt{Var(W^+)}})=2P(Z\leq\frac{3978.5-4795.5}{\sqrt{2.2139225\times 10^{5}}}) =2P(Z\leq -1.74)=0.0825\)

Conclusion:

  • The p-value is smaller than 0.05, we reject the null hypothesis. The mean of … is equal to 30.
  • The p-value is greater than 0.05, we does not reject the null hypothesis. The mean … is not equal to / greater than / less than 30.

15 Wilcoxon rank-sum test - for two samples

Wilcoxon rank-sum test / Mann-Whitney U test:

  • A non-parametric test to compare means of two independent samples.
  • The ranks are summed over one of the samples.
  • Relaxes assumptions of normality and symmetry. It is valid for data from any distribution, and is much less sensitive to outliers than the two-sample t-test.
  • If the assumptions of the two-sample t-test hold, the Wilcoxon rank-sum test is less powerful.
  • Disadvantage: if one is primarily interested in differences in location between the two distributions, the Wilcoxon test has the disadvantage of also reacting to other differences between the distributions such as differences in shape.

15.1 Using wilcox.test - without ties

Hypothesis: \(H_0:\mu_X=\mu_Y\) vs \(H_1:\mu_X\neq \mu_Y\)

Assumptions: Observations \(X_i,Y_i\) are independent and two samples follow the same kind of distribution but differ by a shift.

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  male and female
## W = 2753, p-value = 0.0025
## alternative hypothesis: true location shift is not equal to 0

Test statistic: \(W=R_1+R_2+...+R_{n_X}\). Under \(H_0\), the test statistic follows the \(WRS(n_A,n_B)\) distribution.

Observed test statistic: \(w=r_1+r_2+...+r_{n_x}=2753\)

P-value:

  • for \(H_1:\mu_x>\mu_y\): \(P(W\geq w)\)
  • for \(H_1:\mu_x<\mu_y\): \(P(W\leq w)\)
  • for \(H_1:\mu_x \neq \mu_y\) and \(w>\frac{n_x(N+1)}{2}\): because \(w=2753>\frac{n_x(N+1)}{2}=6279\), so we are looking in the upper tail, \(2P(W\geq w)=0.0025\)
  • for \(H_1:\mu_x \neq \mu_y\) and \(w<\frac{n_x(N+1)}{2}\): because \(w=2753<\frac{n_x(N+1)}{2}=6279\), so we are looking in the lower tail, \(2P(W\leq w)=0.0025\)

Decision:

  • The p-value is smaller than 0.05, we reject the null hypothesis. The mean of … for two samples are the different.
  • The p-value is greater than 0.05, we does not reject the null hypothesis. The mean of … for two samples are the same.

15.2 Normal approximation - with ties

Hypothesis: \(H_0:\mu_X=\mu_Y\) vs \(H_1:\mu_X\neq \mu_Y\)

Assumptions: Observations \(X_i,Y_i\) are independent and two samples follow the same kind of distribution but differ by a shift.

Test statistic: when there are ties, the p-value can be calculated using normal approximation to distribution of test statistic. The statistic follows a normal distribution with identity of \(N(0,1)\), and can be calculated by \(T=\frac{W-E(W)}{\sqrt{Var(W)}}\), where \(E(W)=\frac{n_x(N+1)}{2}=6279\) and \(Var(W)=\frac{n_xn_y}{N(N-1)}(\sum^N_{i=1}r^2-\frac{N(N+1)^2}{4})=5.1831631\times 10^{4}\).

Observed test statistic: \(w=r_1+r_2+...+r_{n_x}=6980.5\)

P-value As the exact \(WRS'(91,46)\) distribution with ties is unknown, we use a normal approximation to this distribution with \(E(W)=\)

  • for \(H_1:\mu_x>\mu_y\): \(P(W\geq 3.08)\approx P(Z\geq \frac{W-E(W)}{\sqrt{Var(W)}})=P(Z\geq 3)= 0.0021\)
  • for \(H_1:\mu_x<\mu_y\): \(P(W\leq 3.08)\approx P(Z\leq \frac{W-E(W)}{\sqrt{Var(W)}})=P(Z\leq 3)= 0.0021\)
  • for \(H_1:\mu_x \neq \mu_y\): \(2P(W\geq 3.08)\approx P(Z\geq |\frac{W-E(W)}{\sqrt{Var(W)}}|)=2P(Z\geq 3)= 0.0021\)

Decision:

  • The p-value is smaller than 0.05, we reject the null hypothesis. The mean of … for two samples are the same.
  • The p-value is greater than 0.05, we does not reject the null hypothesis. The mean of … for two samples are different.

16 Permutation test for continuous data

  • use t-test statistic when no outliers exist, use Wilcoxon rank-sum test or median test statistic when outliers exists.
  • The permutation test uses the t-test test statistic but it does not use the t distribution.
  • The two-sample t-test is a parametric test where the test statistic is assumed to follow some distribution. The permutation test considers the \((n_1+n_2)!\) permutations of the labels from a single instance of the data.

16.1 Using t-test - without outliers

Hypothesis: \(H_0:\mu_x=\mu_y\) vs \(H_1:\mu_x >\mu_y\)

Assumptions: The observation \(X_1,X_2,...,X_{n_x},Y_1,Y_2,...,Y_{n_y}\) are exchangeable, i.e. swapping labels on observations keeps the data just as likely as the original.

Test statistic: There is no outliers exist in the samples, the t-test statistic is used in the permutation test. The t-test test statistic is calculated by \(T=\frac{\bar{X}-\bar{Y}}{S_p \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\), where \(S^2_p=\frac{(n_x-1)S^2_x+(n_y-1)S^2_y}{n_x+n_y-2}\). Note that although t-test statistic is used, the permutation test does not use the t distribution.

Observed test statistic: the original test statistic is \(t_0=\frac{\bar{x}-\bar{y}}{S_p \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\), where \(S^2_p=\frac{(n_x-1)S^2_x+(n_y-1)S^2_y}{n_x+n_y-2}=-2.74\).

Data is permuted by 10000 times and feed into t-test. The test statistics are recorded and distributed as below.

P-value: the p-value, \(P(T\geq |-2.74|)=0.0072,\) is the proportion of test statistics from randomly permuted data being more extreme than the test statistic we observed.

Decision:

  • The p-value is greater than 0.05, the null hypothesis is not rejected. We conclude that the means of two samples are the same.
  • The p-value is smaller than 0.05, the null hypothesis is not rejected. We conclude that the means of two samples are different.

16.2 Using Wilcoxon rank-sum - with outliers

Hypothesis: \(H_0:\mu_x=\mu_y\) vs \(H_1:\mu_x >\mu_y\)

Assumptions: The observation \(X_1,X_2,...,X_{n_x},Y_1,Y_2,...,Y_{n_y}\) are exchangeable, i.e. swapping labels on observations keeps the data just as likely as the original.

Test statistic: there are outliers within samples, t-test testing the mean can be significantly affected by these outliers. Therefore, instead of t-test, the wilcoxon rank-sum test statistic which is robust to outliers is applied in the permutation test. The wilcoxon test statistic is calculated by \(W=R_1+R_2+...+R_{n_X}\)

Observed test statistic: the original test statistic is \(w=r_1+r_2+...+r_{n_x}=1433\)

Data is permuted by 10000 times and feed into wilcoxon rank-sum test. The test statistics are recorded and distributed as below.

P-value: the p-value, \(P(T\geq-2.74)=0.9985,\) is the proportion of test statistics from randomly permuted data being more extreme than the test statistic we observed.

Decision:

  • The p-value is greater than 0.05, the null hypothesis is not rejected. We conclude that the means of two samples are the same.
  • The p-value is smaller than 0.05, the null hypothesis is not rejected. We conclude that the means of two samples are different.

16.3 Using Median-absolute-deviation (MAD) - with outliers

Hypothesis: \(H_0:\mu_x=\mu_y\) vs \(H_1:\mu_x >\mu_y\)

Assumptions: The observation \(X_1,X_2,...,X_{n_x},Y_1,Y_2,...,Y_{n_y}\) are exchangeable, i.e. swapping labels on observations keeps the data just as likely as the original.

Test statistic: there are outliers within samples, t-test testing the mean can be significantly affected by these outliers. Therefore, instead of t-test, MAD (median-absolute-deviation) test statistic which is robust to outliers is applied in the permutation test. The wilcoxon test statistic is calculated by \(T=\widetilde X-\widetilde Y\).

Observed test statistic: the original test statistic is \(t_0=\widetilde x-\widetilde y=2\)

Data is permuted by 10000 times and feed into MAD test. The test statistics are recorded and distributed as below.

P-value: the p-value, \(P(T\geq-2.74)=0.0365,\) is the proportion of test statistics from randomly permuted data being more extreme than the test statistic we observed.

Decision:

  • The p-value is greater than 0.05, the null hypothesis is not rejected. We conclude that the means of two samples are the same.
  • The p-value is smaller than 0.05, the null hypothesis is not rejected. We conclude that the means of two samples are different.

16.4 For one sample

Hypothesis: \(H_0: \mu =30\) vs \(H_1: \mu ><\neq 30\)

Assumptions: The observation \(X_1,X_2,...,X_{n_x}\) are exchangeable, i.e. swapping labels on observations keeps the data just as likely as the original.

Test statistic: the wilcoxon signed-rank test statistic is used, calculated by \(T=\sum^n_{i:d_i>0}r_i\times sign(d_i)\)

Observed test statistic: the original test statistic is -1.2354766

P-value: the p-value, \(P(T\geq|-2.74|)=0.2177,\) is the proportion of test statistics from randomly permuted data being more extreme than the test statistic we observed.

Decision:

  • The p-value is greater than 0.05, the null hypothesis is not rejected. We conclude that the means of two samples are the same.
  • The p-value is smaller than 0.05, the null hypothesis is not rejected. We conclude that the means of two samples are different.

17 Bootstrapping

Bootstrapping:

  • repeatedly resample from the sample (with replacement).
  • the bootstrap confidence interval is the quantiles from the bootstrap distriution: quantile(result, c(0.025, 0.975))

  • allows us to as make inferences about the population where no information is available about the population.

Bootstrapping is useful when:

  • the theoretical distribution of a statistic is complicated or unknown.
  • the sample size is too small to make any sensible parametric inferences about the parameter.

Advantages:

  • Bootstrapping frees us from making parametric assumptions to carry out inferences.
  • Provides answers to problems for which analytic solutions are impossible.
  • Can be used to verify, or check the stability of results.
  • Asymptotically consistent.

The data is resampled by 10000 times from the sample with replacement. Based on the bootstrapping result, our 95% confidence interval is between 25.79 and 30.98.

18 Notched boxplot

The upper and lower edges of the notches are at \(median \pm 1.58\times \frac{IQR}{\sqrt n}\)

Rule of thumb: if the notches of the two boxes do not overlap, this suggests that the medians are significantly different.

19 Q-Q plot

The data doesn’t follow a normal distribution if:

  • A lower bend of zero exist?
  • Not close enough to the line to be considered ‘normal’.
  • Central limit theorem can be applied to give a normal distribution when sample size is large.

20 Types of errors

types of errors:

  • type I error / false positive: \(\theta\) doesn’t equal zero when it does
  • types II error / false negative: \(\theta\) equals zero when it doesn’t

Error rates:

  • False positive rate: the probability at what null results (\(\theta\)=0) are called significant.
  • Family wise error rate (FWER): the probability of at least one false positive, with m tests, \(FWER=P(V\geq 1)=1-(1-\alpha)^m\)
  • False discovery rate (FDR): the rate at which claims of significant are false

20.1 Bonferroni correction - control FWER

  • set \(\alpha^*=\frac{\alpha}{m}\)
  • call all p-values less than \(\alpha^*\) significant
  • pros: easy to calculate, conservative
  • cons: maybe very conservative

20.2 Benjamini-Hochberg procedure - control FDR

steps:

  1. order the p-values from smallest to largest \(p_{(1)}\leq p_{(2)}\leq...\leq p_{(m)}\)
  2. find \(j^*=max~j\) such that \(p_{(j)}\leq \frac{j}{m}\alpha\)
  3. recall all \(H_{0j}\) where \(p_{(i)}\leq \frac{j^*}{m}\alpha\)

pros & cons:

  • pros: still pretty easy to calculate, less conservative than bonferroni
  • cons: allows for more false positives, may behave strangely under dependence

21 ANOVA

Hypothesis: \(H_0:\mu_1=\mu_2=...=\mu_g\) vs \(H_1:\) at least one \(\mu_i \neq \mu_j\).

Assumptions:

  • Observations are independent within each of the g samples
  • Each of the g populations have the same variance, \(\sigma_1^2=\sigma^2_2=...=\sigma_g^2=\sigma\)
  • Each of the g populations are normally distributed (or the sample sizes are large enough such that the central limit theorem can be applied.)

Check assumptions:

  • Independence: Dots are randomly distributed without any pattern in the residual plot, suggesting the independence of observations.
  • Homogeneous variance: dots are evenly and randomly distributed in the residual, showing constant variance of data for each of the g populations.
  • Normality: The Q-Q plot shows dots being reasonably close to the Q-Q line; Although the dots are not perfectly close to Q-Q line, the sample sizes are large enough such that the central limit theorem can be applied.

Test statistic: \(T=\frac{Treatment~Mean~Sq}{Residual~Mean~Sq}=\frac{\sum^g_{i=1}n_i(\bar Y_{i\bullet}-\bar Y_{\bullet\bullet})^2/(g-1)}{\hat\sigma^2}=\frac{\sum^g_{i=1}n_i(\bar Y_{i\bullet}-\bar Y_{\bullet\bullet})^2/(g-1)}{\sum^g_{i=1}n_i(\bar Y_{ij}-\bar Y_{i\bullet})^2/(N-g)}\). Under \(H_0,\) \(T\)~\(F_{g-1,N-g}\) where g is the number of groups.

Observed test statistics: \(t_0=\frac{Treatment~Mean~Sq}{Residual~Mean~Sq}=\)

P-value: \(P(T\geq t_0)=P(F_{g-1,N-g}\geq t_0\)

Decision:

  • the p-value is ____, which is less than \(\alpha\). We reject the null hypothesis and conclude that the population mean of at least one group is significantly different to the others.
  • the p-value is ___, which is greater than \(\alpha\). We do not reject the null hypothesis and conclude that there is no significant difference between the population means.

21.1 Dot notation

  • total for sample i is \(\sum^{n_i}_{j=1}j_{ij}=\bar{y_{i\bullet}}\)
  • average for sample i is \(\frac{1}{n_i}\sum^{n_i}_{j=1}y_{ij}=\bar y_{i\bullet}\)
  • grand total of all observations is \(\sum^g_{i=1}\sum^{n_i}_{j=1}y_{ij}=\bar y_{\bullet\bullet}\), where \(N=n_1+...+n_g\) is the total number of observations
  • the i-th group’s sample variance: \(s^2_i=\frac{1}{n_i-1}\sum^{n_i}_{j=1}(y_{ij}-\bar y_{i\bullet})^2\)

21.2 ANOVA decomposition

  • Total Sum of Squares: \(TSS=\sum^g_{i=1}\sum^{n_i}_{j=1}(y_{ij}-\bar y_{\bullet\bullet})^2=Residual~SS+Treatment~SS\)
  • Sample variance: Under \(H_0:\hat \sigma^2_0=\frac{Total~SS}{N-1}=\frac{\sum^g_{i=1}\sum^{n_i}_{j=1}(Y_{ij}-\bar Y_{\bullet\bullet})^2}{N-1}\); Under \(H_1:\hat\sigma^2=\frac{\sum^g_{i=1}\sum^{n_i}_{j=1}(Y_{ij}-\bar Y_{i\bullet})^2}{N-g}\)=Residual Mean Square
  • Residual sum of Squares: \(Residual~SS=\sum^g_{i=1}\sum^{n_i}_{j=1}(Y_{ij}-\bar Y_{i\bullet})^2=\sum^g_{i=1}(n_i-1}S^2_i\) ~ \(\sigma^2\chi^2_{N-g}\)
  • Residual mean square: \(\hat \sigma^2=\frac{\sum^g_{i=1}\sum^{n_i}_{j=1}(Y_{ij}-\bar Y_{i\bullet})^2}{N-g}\)~\((\frac{\sigma^2}{N-g})\chi^2_{N-g}\)
  • Treatment sum of squares: \(\sum^g_{i=1}\sum^{n_i}_{j=1}(Y_{ij}-\bar Y_{\bullet\bullet})^2=\sum^g_{i=1}\sum^{n_i}_{j=1}(Y_{ij}-\bar Y_{i\bullet})^2+\sum^g_{i=1}n_i(\bar Y_{i\bullet}-\bar Y_{\bullet\bullet})^2\)
  • Treatment mean square: \(\frac{Treament~SS}{g-1}=\frac{\sum^g_{i=1}n_i(\bar Y_{i\bullet}-\bar Y_{\bullet\bullet})^2}{g-1}\)

21.3 F-statistic

\(\frac{Treatment~Mean~Sq}{Residual~Mean~Sq}=\frac{\sum^g_{i=1}n_i(\bar Y_{i\bullet}-\bar Y_{\bullet\bullet})^2/(g-1)}{\hat\sigma^2}=\frac{\sum^g_{i=1}n_i(\bar Y_{i\bullet}-\bar Y_{\bullet\bullet})^2/(g-1)}{\sum^g_{i=1}n_i(\bar Y_{ij}-\bar Y_{i\bullet})^2/(N-g)}\). Under \(H_0,\) \(T\)~\(F_{g-1,N-g}\) where g is the number of groups.

  • the denominators (Residual mean square) is always an unbiased estimator of \(\sigma^2\) regardless of whether \(H_0\) is true or not.
  • the numerator (Treatment mean square) is only an unbiased estimator of \(\sigma^2\) if \(H_0\) is true, otherwise it tends to be bigger.

21.4 Contrast

A contrast is a linear combination where the coefficients add to 0. In an ANOVA, a contrast is a linear combination of means.

  • population contrast: e.g. \(\mu_1-\mu_2\)
  • sample contrast: e.g. \(\bar y_{1\bullet}-\bar y_{2\bullet}\) and \(\bar Y_{1\bullet}-\bar Y_{2\bullet}\)

Under \(H_0:\mu_1=...=\mu_g\):

  • all population contrasts are zero: \(\sum^g_{i=1}c_i\mu_i=\sum^g_{i=1}c_i\mu=\mu\sum^g_{i=1}c_i=0\)
  • all sample contrasts have expectation zero: \(E(\sum^g_{i=1}c_i\bar Y_{i\bullet}=\sum^g_{i=1}c_i\mu_i=0\)
  • Therefore \(H_0\) can be “all population contrasts are zero”.

T-test for individual contrast - generalize the two-sample t-statistic: \(T=\frac{\sum^g_{i=1}c_i\bar Y_{i\bullet}-\sum^g_{i=1}c_i\mu_i}{\hat \sigma \sqrt{\sum^g_{i=1}\frac{c^2_i}{n_i}}}\) ~ \(t_{N-g}\), where \(\hat \sigma=\sqrt {ResMS}\)~\(\sqrt{\chi^2_{N-g}/(N-g)}\) under \(H_0(\sum^g_{i=1}c_i\mu_i=0)\)

  • this is better than an ordinary two-sample t-test because it has a smaller standard error and therefore better estimate of \(sigma\).

Standard error of contrasts: \(\hat \sigma \sqrt{\sum_i \frac{c^2_i}{n_i}})\), where \(\hat \sigma=\sqrt{Residual~mean~square}\)

Confidence interval of contrasts`: \(\sum_i c_i\bar y_{i\bullet}\pm t^*(\hat \sigma \sqrt{\sum_i \frac{c^2_i}{n_i}})\), where \(\hat \sigma=\sqrt{Residual~mean~square}\)

21.6 Post hoc tests

Multiple comparisons of CIs:

  • Bonferroni correction
  • Tukey’s method
  • Scheffe’s method

21.6.1 Bonferroni correction

How: Divide the original alpha level (most like set to 0.05) by the number of tests being performed. The output from the equation is a Bonferroni-corrected p value which will be the new threshold that needs to be reached for a single test to be classed as significant.

Why: When conducting multiple analyses on the same dependent variable, the chance of committing a Type I error increases, thus increasing the likelihood of coming about a significant result by pure chance. To correct for this, or protect from Type I error, a Bonferroni correction is conducted.

test(pairs(flicker_em, adjust = "none"))
test(pairs(flicker_em, adjust = "bonferroni"))
  • is generally conservative, i.e. p-values and confidence intervals may be larger than they really need to be.

21.6.2 Tukey’s method

confint(pairs(flicker_em, adjust = "tukey"))
test(pairs(flicker_em, adjust = "tukey"))
  • more accurate and less conservative than bonferroni when sample sizes are equal for simultaneous confidence intervals comparisons
  • when sample sizes are unequal, Tukey’s procedure in conservative and therefore intervals may be narrower than those using Bonferroni method.
  • the smallest p-value is the F-test p-value

21.6.3 Scheffe’s method

confint(pairs(flicker_em, adjust = "scheffe"))
test(pairs(flicker_em, adjust = "scheffe"))

\(t^*_{Sch}(\alpha)=\sqrt{(g-1)F_{g-1,N-g}(\alpha)}=\sqrt{(g-1)*qf(1-\alpha,g-1,N-g)}\)

CI: \(\sum^g_{i=1}c_i\bar Y_{i\bullet}\pm t^*_{Sch}(\alpha)\hat \sigma \sqrt{\sum^g_{i=1}\frac{c^2_i}{n_i}}\)

  • With \(t^*_{Sch}(\alpha)\), all sample contrasts include their true population values is exactly \(1-\alpha\)
  • the smallest p-value is the F-test p-value

21.7 Relaxing homogeneous variance

Pairwise Welch test:

  • Assumption: t each sample is normal, with possibly different variance \(\sigma_X^2,\sigma_Y^2\) and different means, and all random variables are independent
  • Test statistic: \(T=\frac{\bar X-\bar Y}{\sqrt{\frac{S^2_X}{M}+\frac{S_Y^2}{n}}}\), which is approximately \(t_{d*(m,n,\sigma_X,\sigma_Y)}\) under \(H_0\).

21.8 Relaxing normality

A weaker set of assumptions:

  • all observations come from the same distribution

21.8.1 Permutation test with ANOVA

Permutation test avoids making normality assumption by resampling.

How: randomly permute the observation vector, kepping the factor vector fixed. Repeat the permutation a large number of times. The observed proportion of the times the statistic exceeds observed test statistic becomes an estimate of the exact p-value

21.8.2 Kruskal-wallis test

  • replace each observation by its global rank, then compute the F-ratio as usual on the ranks
  • p-value is obtained using a permutation test or a “large-sample” \(\chi^2\) approximation.

Hypothesis: \(H_0:\) the response variable is distributed identically for all groups vs \(H_1:\) the response variable is systematically higher for at least one group

Assumptions: Observations are independent within each group and groups are independent of each other. The different groups follow the same distribution (differing only by the location parameter).

Test statistic: \(T=\frac{Treatment~SS~of~ranks}{Variance~of~all~ranks}\), which has an approximate \(\chi^2_{g-1}\) distribution under \(H_0\).

P-value: \(P(T\geq t_0)=P(\chi^2_{g-1}\geq t_0)\)

Decision:

  • the p-value is less than \(\alpha\), we reject the null hypothesis and conclude that the population mean of at least one group is significantly different to the others.
  • the p-value is greater than \(\alpha\), we do not reject the null hypothesis and conclude that there is no significant difference between the population means.

21.9 Two-way ANOVA

  • residual sum of squares always follows a \(\sigma^2\chi^2_{(n-1)(g-1)}\) distribution
  • under \(H_0\), the treatment sum of squares follows a \(\sigma^2\chi^2_{g-1}\) distribution. Otherwise, it tends to take larger value.

21.9.1 Blocking:

  • a two-way ANOVA with blocking can be thought of as a generalization of the paired t-test where each pair is a block
  • Although the treatment sum of squares and block sum of squares are mathematically identical, they are playing very different scientific roles.

21.9.2 Friedman test: adjusting for blocks using ranks

  • each observation is replaced by it within-block rank
  • a one-way ANOVA F-test is performed on the ranks
  • Test statistic: \(T=\frac{Treatment~sum~squares~of~ranks}{Total~sum~of~squares~of~ranks/n(g-1)}\), which follows an approximate \(\chi^2_{g-1}\) distribution under \(H_0\).
  • Permutation test is applicable

21.9.3 Two-way ANOVA with interaction

  • if the interaction effect is significant while the main effects of treatments are not, we must still retain the full model.

interaction plot:

  • If there is no interaction the traces should be “roughly parallel”.
  • If there is an interaction, the traces may cross or deviate from parallelism in some other way

22 Linear regression

22.1 Hypothesis

Hypothesis:

  • SLR: \(H_0:\beta_1=0\) vs \(H_1:\beta_1 \neq 0, \beta_1>0,\beta_1<0\)
  • MLR: \(H_0:\beta_1=\beta_2=...=\beta_n=0\) vs \(H_1:\) at least one of the coefficients is not zero

Assumptions:

  • Linearity: the relationship between Y and x is linear
  • Independence: all the errors are independent of each other
  • Homoskedasticity: the errors have constant variance \(Var(\epsilon_i)=\sigma^2\) for all \(i=1,2,...,n\)
  • Normality: the errors follow a normal distribution

Checking assumptions:

  • Linearity: residuals should be symmetrically distributed above and below zero; A cured pattern in the residuals is evidence for non-linearity.
  • Independence: data-driven
  • Homoskedasticity: residuals should be randomly distributed without any pattern; A spread-out pattern of residuals is evidence for heteroskedasticity.
  • Normality: Apart from a few outliers in tails, the majority of points lie close to the diagonal line in the QQ plot. Hence, the normality assumption for residuals is reasonably well satisfied. Additionally, we have quite a large sample size we we can also rely on the central limit theorem to give us approximately valid inferences.

Test statistic: \(T=\frac{\hat \beta_1}{SE(\hat\beta_1)}\)~\(t_{n-2}\) under \(H_0\).

P-value:

  • \(P(t_{n-2}\geq t_0)\) for \(H_1:\beta_1>0\)
  • \(P(t_{n-2}\leq t_0)\) for \(H_1:\beta_1<0\)
  • \(2P(t_{n-2}\geq|t_0|)\) for \(H_1:\beta_1\neq 0\)

Conclusion:

  • p-value is less than \(\alpha\), we reject null hypothesis and conclude that \(\beta_1\) is not equal to zero.
  • p-value is greater than \(\alpha\), we do not reject null hypothesis and conclude that \(\beta_1\) is equal to zero.

22.2 Confidence Interval of coefficient

\(\hat \beta_1 \pm t* \times SE(\hat \beta_1)\)

22.3 R-squared - in-sample performance

  • Add variable will always increases \(R^2\)
  • \(R^2=0.5\): 50% of the observed variation in y is explained by the model.
  • Doesn’t protect against overfitting

22.4 Interpreting coefficients

  • Linear-linear: on average, a one unit increases in x will result in a \(\beta\) units increase/decrease in y, with all other variables being constant.
  • Linear-log: on average, a one percent increase in x will result in a \(\beta_1/100\) units change in y, with all other variables being constant.
  • Log-linear: on average, a one unit in x will result in a \(\beta_1 \times 100\)% change in y, with all other variables being constant.
  • Log-log: on average, a one percent increase in x will result in a \(\beta\)% change in y, with all other variables being constant.

22.5 MLR coefficients

\(\hat \beta=(X^TX)^{-1}X^Ty\)

22.6 Stepwise selection

Backward selection:

  1. Start with model containing all possible explanatory variables
  2. For each variable in turn, investigate effect of removing variables
  3. Remove the least informative variables, unless this variable is nonetheless supplying significant information about the response
  4. Repeating step 2-3 until all variables in the current model are important.

Forward select:

  1. Start with a null model with only the constant.
  2. For each variable not in the current model, investigate effect of including it.
  3. Include the most statistically significant variable not currently in model (unless no significant variable exists)
  4. repeat step 2-3 until no changes in the model.

22.7 Prediction interval vs Confidence interval

Prediction interval: \(\hat y_0\pm t^* \sqrt{Var(\hat e)}\)

  • Used when estimate the average y with a given x and give a 95% estimation interval of this estimate.

Confidence interval: \(\hat y_0 \pm t^* \sqrt{Var(\hat y_0)}\) - Used when predict y on a day with a given x and give a 95% prediction interval of the prediction.

22.8 Out of sample performance

  • Root mean square error: \(RMSE=\sqrt{\frac{\sum^n_{i=1}(y_i-\hat y_i)^2}{n}}\)
  • Mean absolute error: \(MAE=\frac{\sum^m_{i=1}|y_i-\hat y_i|}{m}\)

22.9 K-fold cross validation:

  1. Randomly divide data into k subsets of equal size
  2. Estimate the model by leaving one subset out
  3. Use the model to predict observations left out
  4. Compute error rates on the left out set
  5. Repeat k times for each of the subsets average the error rate over the k runs

NOTE:

  • Bias-variance trade-off: a smaller k gives larger bias but smaller variance.
  • CV is computationally intensive.

23 Logistic regression

Logistic regression: used to classify binary observations. It isn’t feasible when:

  • we have high class separation in data
  • we have a non-linear combination of predictors influencing the response probability of Y is 1: \(P(Y=1|x_0)=\frac{exp(x_i^T\beta)}{1+exp(x_i^T\beta)}\)

Binary decision:

  • if \(P(Y=1|x_0)>0.5\), the prediction \(\hat Y=1\)
  • if \(P(Y=1|x_0)<0.5\), the prediction \(\hat Y=0\)
  • \(P(Y=1|x_0)=0.5\) gives the decision boundary

23.1 Hypothesis

Hypotheses: \(H_0=\beta_{age}=0\) vs \(H_1:\beta_{age}\neq 0\)

Test statistic: \(T=\frac{\hat \beta_{age}-\beta_{age}}{SE(\hat \beta_{age})}\)~\(N(0,1)\)

P-value: \(2P(T\geq|t_0|)=2P(Z\geq t_0)=\)

Conclusion:

  • the p-value is less than \(\alpha\), we reject the null hypothesis and conclude that x is a significant predictor for y.
  • the p-value is greater than \(\alpha\), we do not reject the null hypothesis and conclude that x is not a significant predictor for y.

23.2 Logit function - interpretation

Log-odds: \(logit(p)=log(\frac{p}{1-p})\)

Formula: \(logit(p)=\beta_0+\beta_1 x_1+\beta_2 x_2\)

Interpretation:

\(\beta_0:\) the log-odds of y for an object with \(x_1=0,x_2=0\)

for categorical \(x_1\): \(\beta_1-\) holding all other features constant, \(\beta_1\) represents the difference in the log-odds between \(x_1=1\) and \(x_1=0\).

  • if \(\beta_1<0\), the odds of y is lower if \(x_1=0\)
  • if \(\beta_1>0\), the odds of y is higher if \(x_1=0\)

for numeric \(x_2\): \(\beta_2-\) holding all other features constant,

  • if \(\beta_2<0\), a one unit increase of \(x_2\) will result in \(\beta_2\) units decrease in y.
  • if \(\beta_2>0\), a one unit increase of \(x_2\) will result in \(\beta_2\) units increase in y.

Prediction: round logit(p) to 0 or 1.

23.3 Predictive performance

23.3.1 Insample performance

Resubstitution error rate: measures the proportion of data points we predict correctly when trying to predict all the points we used to fit the model - \(r=\frac{1}{n}\sum^n_{i=1}(y_i\neq\hat y_i)\) - interpret: we fail to correctly classify \(r\times 100\)% of the observations.

  • Sensitivity / Recall\(=\frac{TP}{P}=\frac{TP}{TP+FN}\)
  • Specificity \(=\frac{TN}{N}\frac{TN}{TN+FP}\)
  • Precision / Positive predictive value \(=\frac{TP}{TP+FP}\)
  • Negative predictive value \(=\frac{TN}{TN+FN}\)

23.3.2 Out of sample performance - CV error rate

CV error rate: split data into k-folds. Treat the fist fold as a testing set, and the method is fit on the remaining k-1 folds. The misclassification error rate is calculated on the observations in the held-out fold. Repeat the procedure k times and calculate the average error rate as CV score.

  • Bias-variance trade-off: a large k has a low bias and high variance.

24 Decision tree & Random forest

24.1 Decision tree

Rationale: determines the predicted outcome based on series of questions and conditions. Classify observations into 2 or more classes.

Hyperparameter: complexity parameter that can be used to determine if a proposed new split sufficiently improves the predictive power or not. It helps to avoid overfitting.

Weakness:

  1. can be massively overfitting the data;
  2. The selected tree might be very sensitive to the complexity penalty;
  3. Can only make decisions parallel to axes.

24.2 Random forest

Steps:

  1. Choose hyperparameters: number of decision trees to grow & number of variables to consider in each tree
  2. Randomly select the rows of the data frame with replacement.
  3. Randomly select the appropriate number of variables from the data frame.
  4. Build a decision tree on the resulting data set.
  5. Repeat the above steps a large number of times.
  6. Make a prediction by majority rule (i.e. run new observation through all the trees in the forest and see which class is predicted most often.)

Weakness: It is a black-box procedure and therefore the it is not interpretable.

25 K-nearest neighbours

  • knn works well with a small number of input variables, but struggles when the number of inputs is very large.
  • knn makes no assumptions about the functional form of the problem being solved.
  • knn can be used in both classification and regression.
  • knn performs much better if all of the data have the same scale.

Algorithm:

  1. Calculate the Euclidian distance \(d(x,z)=\sqrt{(x_1-z_1)^2+(x_2-z_2)^2+...+(x_p-z_p)^2}\), between any two points.
  2. Find the k observations in the training data, that are closest to x according to Euclidian distance.
  3. Define an aggregation function \(f\) and compute \(\hat y=f(x)\) for the k values of y.

KNN properties:

  • Non-parametric algorithm (doesn’t assume the data follows any particular shape).
  • Adv: 1) Doesn’t require any preprocessing time. 2) Performance improves as the sample size grows; 3) Analytically tractable and simple implementation; 4) uses local information, and is highly adaptive; 5) Can be easily parallelised
  • Disadv: 1) Computationally intensive when predicting new observations as the distance between every two points need to be calculated and reordered; 2) Usefulness depends on the geometry of the data; 3) Large storage requirement as the whole data set needs to be stored;

26 Checking assumptions:

  • Linearity: residuals should be symmetrically distributed above and below zero; A cured pattern in the residuals is evidence for non-linearity.
  • Independence: data-driven
  • Homoskedasticity: residuals should be randomly distributed without any pattern; A spread-out pattern of residuals is evidence for heteroskedasticity.
  • Normality: Apart from a few outliers in tails, the majority of points lie close to the diagonal line in the QQ plot. Hence, the normality assumption for residuals is reasonably well satisfied. Additionally, we have quite a large sample size (\(n\geq 30\)) we we can also rely on the central limit theorem to give us approximately valid inferences.

27 What is permutation?

Randomly permute the observation vector, kepping the factor vector fixed. Repeat the permutation a large number of times. The observed proportion of the times the statistic exceeds observed test statistic becomes an estimate of the exact p-value.

28 Clustering

28.1 Hierarchical clustering

Hierarchical clustering methods produce a tree or dendrogram.

  • bottom-up: agglomerative clustering
  • top-down: divisive clustering

Advantages of hierarchical clustering:

  • Don’t need to know how many clusters you’re after.
  • Can cut hierarchy at any level to get any number of clusters.
  • Easy to interpret hierarchy for particular applications.
  • Deterministic.

28.2 K-mean clustering

  1. Randomly assign a number, from 1 to K, to each of the observations. (These serve as initial cluster assignments for the observations).
  2. Iterate until the cluster assignments stop changing.
  1. For each of the K clusters, computer the cluster centroid. The kth cluster centroid is the vector of the p covariate means for the observations in the kth cluster.
  2. Assign each observation to the cluster whose centroid is closest (where closest is deemed using Euclidean distance).

Advantages of k-means clustering:

  • Can be much faster than hierarchical clustering, depending on data.
  • Nice theoretical framework.
  • Can incorporate new data and reform clusters easily.

29 PCA

  • Principal component analysis (PCA) produces a low-dimensional representation of a dataset. It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated.
  • It is an unsupervised learning method.

Why?

  • We may have too many predictors for a regression. Instead, we can use the first few principal components. PCA regression.
  • Understanding relationships between variables - similar to a correlation matrix.
  • Data visualisation: we can plot a small number of variables more easily than a large number of variables

Interpretation: examine the magnitude and direction of the coefficients for the original variables. The larger the absolute value of the coefficient, the more important the corresponding variable is in calculating the component.